Goto

Collaborating Authors

 lipschitz function


Muon Does Not Converge on Convex Lipschitz Functions

arXiv.org Machine Learning

Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.



Contextual Pricing for Lipschitz Buyers

Neural Information Processing Systems

We investigate the problem of learning a Lipschitz function from binary feedback. In this problem, a learner is trying to learn a Lipschitz function $f:[0,1]^d \rightarrow [0,1]$ over the course of $T$ rounds. On round $t$, an adversary provides the learner with an input $x_t$, the learner submits a guess $y_t$ for $f(x_t)$, and learns whether $y_t > f(x_t)$ or $y_t \leq f(x_t)$. The learner's goal is to minimize their total loss $\sum_t\ell(f(x_t), y_t)$ (for some loss function $\ell$). The problem is motivated by \textit{contextual dynamic pricing}, where a firm must sell a stream of differentiated products to a collection of buyers with non-linear valuations for the items and observes only whether the item was sold or not at the posted price.




5d9e4a04afb9f3608ccc76c1ffa7573e-Supplemental.pdf

Neural Information Processing Systems

Sets and scalars are represented by calligraphic and standard fonts,6 respectively. Intuitively, if ฮฆ (w0) is a (ยตฮฆ,ฮฝฮฆ)-near-isometry, then one would expect ฮฆ to remain near-10 isometry forallnearby points. We start with the basic definition of Hermite polynomial and its properties. A bound on (2kvk + kฮดvk) is obtained in (A.41). Let z Rd denote a Gaussian random vector.


Initialization-Dependent Sample Complexity of Linear Predictors and Neural Networks

Neural Information Processing Systems

Clearly, in order for learning to be possible, we must impose some constraints on the size of the function class. One possibility is to bound the number of parameters (i.e., the dimensions of the matrix W), in which case learnability follows from standard VC-dimension or covering number arguments (see Anthony and Bartlett [1999]).


Failure of uniform laws of large numbers for subdifferentials and beyond

arXiv.org Machine Learning

We provide counterexamples showing that uniform laws of large numbers do not hold for subdifferentials under natural assumptions. Our results apply to random Lipschitz functions and random convex functions with a finite number of smooth pieces. Consequently, they resolve the questions posed by Shapiro and Xu [J. Math. Anal. Appl., 325(2), 2007] in the negative and highlight the obstacles nonsmoothness poses to uniform results.


Contextual Pricing for Lipschitz Buyers

Neural Information Processing Systems

We investigate the problem of learning a Lipschitz function from binary feedback. In this problem, a learner is trying to learn a Lipschitz function $f:[0,1]^d \rightarrow [0,1]$ over the course of $T$ rounds. On round $t$, an adversary provides the learner with an input $x_t$, the learner submits a guess $y_t$ for $f(x_t)$, and learns whether $y_t > f(x_t)$ or $y_t \leq f(x_t)$. The learner's goal is to minimize their total loss $\sum_t\ell(f(x_t), y_t)$ (for some loss function $\ell$). The problem is motivated by \textit{contextual dynamic pricing}, where a firm must sell a stream of differentiated products to a collection of buyers with non-linear valuations for the items and observes only whether the item was sold or not at the posted price.